Reliable Evaluations of URL Normalization

نویسندگان

  • Sung Jin Kim
  • Hyo Sook Jeong
  • Sang Ho Lee
چکیده

URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for web pages can be reduced significantly. There are a number of normalization methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalization of qPCR array data: a novel method based on procrustes superimposition

MicroRNAs (miRNAs) are short, endogenous non-coding RNAs that function as guide molecules to regulate transcription of their target messenger RNAs. Several methods including low-density qPCR arrays are being increasingly used to profile the expression of these molecules in a variety of different biological conditions. Reliable analysis of expression profiles demands removal of technical variati...

متن کامل

Browser Extension TO Removing Dust Using Sequence Alignment and Content Matching

---------------------------------------------------------------------***--------------------------------------------------------------------Abstract If documents of two URLs are similar, then they are called DUST. Similarly, detection of near duplicate documents is complex. The duplicate documents content will be similar but there will be small differences in the content. Different URLs with sa...

متن کامل

Normalization of Parents’ Response to Children’s Positive Emotions Scale

Abstract This study evaluated the normalization of the Persian version of the Parents’ Response to Children’s Positive Emotions Scale (PRCPS). For evaluating reliability and validity of this scale through random sampling, 400 mothers of 4-7-year-old children completed the PRCPS and Cognitive Emotion Regulation Questionnaire (CERQ). Evaluating internal reliability of PRCPS subscales by Cronba...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Belief Functions on MV-algebras of Fuzzy Sets: An Overview

Belief functions are the measure theoretical objects Dempster-Shafer evidence theory is based on. They are in fact totally monotone capacities, and can be regarded as a special class of measures of uncertainty used to model an agent?s degrees of belief in the occurrence of a set of events by taking into account different bodies of evidence that support those beliefs. In this chapter we present ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006